We are living in one of the most globally important times in recent memory, due solely to the Global pandemic of COVID-19. I wanted to see the statistics of what is going on in other countries, at least what they report out, and make inferences on the response the country based solely on the statistics.
This is the original dataset, which contains information about 224 different countries Covid statistics including: Total Cases, Deaths, Recovered Cases, etc. You can view where it came from here: COVID DataSet
summary(df)
## Country, Other Total Cases Total Deaths Total Recovered
## Length:224 Length:224 Length:224 Length:224
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
## Active Cases Serious / Critical Condition Total Cases / 1M Population
## Length:224 Length:224 Length:224
## Class :character Class :character Class :character
## Mode :character Mode :character Mode :character
## Deaths / 1M Population Total Tests Tests / 1M Population
## Length:224 Length:224 Length:224
## Class :character Class :character Class :character
## Mode :character Mode :character Mode :character
## Population
## Length:224
## Class :character
## Mode :character
Because the original dataset is so large, it made making useful visualizations difficult, therefore I decided to make it so we only see the Top Ten countries in terms of Total Cases, that being: USA, India, Brazil, UK, Russia, France, Turkey, Iran, Argentina and Colombia
summary(top10C)
## Country, Other Total Cases Total Deaths Total Recovered
## Argentina:1 Min. : 4936052 Min. : 60903 Min. : 4682704
## Brazil :1 1st Qu.: 5725558 1st Qu.:115939 1st Qu.: 5292097
## Colombia :1 Median : 7074626 Median :130294 Median : 6357544
## France :1 Mean :14088938 Mean :258543 Mean :12488119
## India :1 3rd Qu.:17636515 3rd Qu.:382167 3rd Qu.:16778642
## Iran :1 Max. :42634054 Max. :688486 Max. :32598424
## (Other) :4
## Active Cases Serious / Critical Condition Total Cases / 1M Population
## Min. : 33630 Min. : 542 Length:10
## 1st Qu.: 244267 1st Qu.: 1214 Class :character
## Median : 391220 Median : 2150 Mode :character
## Mean :1342276 Mean : 5762
## 3rd Qu.: 576296 3rd Qu.: 7984
## Max. :9597842 Max. :25209
##
## Deaths / 1M Population Total Tests Tests / 1M Population
## Min. : 318 Min. : 24252818 Length:10
## 1st Qu.:1346 1st Qu.: 36866602 Class :character
## Median :1872 Median :108105076 Mode :character
## Mean :1723 Mean :198975288
## 3rd Qu.:2347 3rd Qu.:262773530
## Max. :2749 Max. :615684393
##
## Population
## Min. :4.570e+07
## 1st Qu.:6.617e+07
## Median :8.536e+07
## Mean :2.492e+08
## 3rd Qu.:1.973e+08
## Max. :1.396e+09
##
I wanted to analyze the Top Ten countries to see the way that COVID has effected them by as well as their responses to them, and what we can infer from there. The biggest limitation in doing so is that the Data only contained one categorical variable being the Country name, which made it impossible to do a wide variety of different graphs, with bar and pie charts being the only real options.
This first graph shows the Top Ten countries and how many cases they had. From what we can see its obvious that the USA, India as well as Brazil, see far more cases than all the other countries. Some of that can be assumed to be because they are also the three highest countries in Population.
max_y <- round_any(max(top10C$`Total Cases`), 50000000, ceiling)
fig <-ggplot(top10C, aes(x = reorder(top10C$`Country, Other`, `Total Cases`, sum), y = `Total Cases`)) +
geom_bar(stat = "identity", color = "black", fill = "darkblue") +
coord_flip() +
labs(title = "Total Covid Cases by Country(Top 10)", x = "Country", y = "Total Cases") +
theme_grey() +
theme(plot.title = element_text(hjust = 0.5)) +
geom_text(aes(label = scales::comma(`Total Cases`)), hjust =-0.1, size = 4) +
scale_y_continuous(labels = comma, breaks = seq(0, max_y, by = 10000000), limits = c(0, max_y))
fig
This second Graph shows the total amount of tests each of those Top Ten countries have given. What is interesting about this is, that while the USA and India are still first and second as far as number of tests, Brazil who has the third highest number of cases, is only the seventh highest in terms of testing. This could mean that their total cases could be even higher, as they aren’t testing nearly as aggressively or numerously as their peers. It could also mean that the other countries who have less cases but more tests, could be over testing.
max_y <- round_any(max(top10C$`Total Tests`), 700000000, ceiling)
fig4 <-ggplot(top10C, aes(x = reorder(top10C$`Country, Other`, `Total Cases`, sum), y = `Total Tests`)) +
geom_bar(stat = "identity", color = "black", fill = "darkgreen") +
labs(title = "Total Covid Tests by Country(Top 10)", x = "Country", y = "Total Tests") +
theme_grey() +
theme(plot.title = element_text(hjust = 0.5)) +
geom_text(aes(label = scales::comma(`Total Tests`)), vjust =-0.5, size = 4) +
scale_y_continuous(labels = comma, breaks = seq(0, max_y, by = 100000000), limits = c(0, max_y))
fig4
This Pie Chart shows the Deaths per 1M people of that countries population. As we can see that two of the highest in terms of this percentage are Colombia and Argentina, even though they are at the bottom of the amount of cases in the Top Ten. This is due to the fact that thought they aren’t losing people at the same rate as other countries, those countries have a significantly higher population, meaning that they are losing less of their overall population.
The most interesting is Brazil however. This is because they are at the top of this list despite being third in cases, meaning they are losing significant quantities of their population as opposed to India who is losing a small fraction of their total population.
USdf <- top10C[1,]
Idf <- top10C[2,]
fig1 <- plot_ly() %>%
layout(title = "Deaths/1M of Top 10 Countries in Total cases of COVID") %>%
add_trace(
data = top10C,
labels = ~`Country, Other`,
values = ~`Deaths / 1M Population`,
type = "pie",
textposition = "outside",
hovertemplate = "Country: %{label}<br>Percent:%{percent}<br>Deaths/1M: %{value}<extra></extra>"
)
fig1
This is the most interesting and important of the Dataset. These two pie charts show the differences between the two countries with the highest number of cases: the US and India. The charts consist of Recoveries, Active Cases and Deaths, and also show the Total Population, Total Cases and Total Tests for context.
The First is for the USA. We can see that over 75% of people ended up recovering, while 22% are active, and another ~2% have died. These are numbers that are to be expected as we can see that they test about twice as much as their population, which leads to the thought that they are actively testing people to be able to diagnose them with COVID-19.
fig2 <- plot_ly(hole = 0.4) %>%
layout(title = "USA Statistics") %>%
layout(annotations = list(text = paste0("TotalCases: \n",
scales::comma(USdf$`Total Cases`),"\n","\n",
"Population: \n",
scales::comma(USdf$Population),"\n","\n",
"Total Tests: \n",
scales::comma(USdf$`Total Tests`)),
"showarrow" = F)) %>%
add_trace(
data = USdf,
values = c(USdf$`Active Cases`,USdf$`Total Recovered`,USdf$`Total Deaths`),
labels = c("Active Cases", "Recovered","Deaths"),
type = "pie",
textposition = "inside",
hovertemplate = "Country: USA<br>%{label}<br>Percent: %{percent}<br>Total: %{value}<extra></extra>"
)
fig2
The second is far more interesting, as if you don’t have the context of their population and testing, it would seem as though India is doing an incredible job when it comes to getting and keeping their population safe. However, when you have that context, you see that it isn’t all that it appears to be. India is one of the most populated countries in the world at ~1.4 billion people. Despite that they only have 547M tests, a bit over a third of their total population. What this says to me is that they either aren’t testing people as much as they should be, or they aren’t reporting those tests to keep their numbers looking so good, so as to maintain an image.
fig3 <- plot_ly(hole = 0.4) %>%
layout(title = "India Statistics") %>%
layout(annotations = list(text = paste0("TotalCases: \n",
scales::comma(Idf$`Total Cases`),"\n","\n",
"Population: \n",
scales::comma(Idf$Population),"\n","\n",
"Total Tests: \n",
scales::comma(Idf$`Total Tests`)),
"showarrow" = F)) %>%
add_trace(
data = Idf,
values = c(Idf$`Active Cases`,Idf$`Total Recovered`,Idf$`Total Deaths`),
labels = c("Active Cases", "Recovered","Deaths"),
type = "pie",
textposition = "inside",
hovertemplate = "Country: India<br>%{label}<br>Percent: %{percent}<br>Total: %{value}<extra></extra>"
)
fig3
In conclusion there is a lot we can learn from this dataset. It can give us the means in which to learn how countries have responded to the pandemic, tottaly without any societal influence. There are many things that the data set lacks however, that I believe could make it much more informative. Namely information regarding Vaccines. I think it is interesting we can now take this theoretically unbiased data, and see the bias and or limitations it has. We have no real way of knowing exactly how the data was collected, meaning we don’t know who is the one that isn’t being forthcoming with some things, like the case we discussed with India.